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Abstract. The asymptotic behaviour of a family of gradient algorithms (including the meth- 
ods of steepest descent and minimum residues) for the optimisation of bounded quadratic op- 
erators in R d and Hilbert spaces is analyzed. The results obtained generalize those of Akaike 
(1959) in several directions. First, all algorithms in the family are shown to have the same 
asymptotic behaviour (convergence to a two-point attractor), which implies in particular that 
they have similar asymptotic convergence rates. Second, the analysis also covers the Hilbert 
space case. A detailed analysis of the stability property of the attractor is provided. 



1. Introduction 

The paper generalizes the results presented in [16) to other optimisation algo- 
rithms of the gradient type. We introduce a class of algorithms, called P-gradicnt 
algorithms, that differ by the choice of the length of the step made in the gradient 
direction. The class includes in particular the usual steepest-descent algorithm 
and the method of minimal residues of Krasnosel'skii and Krein [91110]. We show- 
that for a quadratic function, the worst asymptotic rate of convergence is the 
same for the whole class of algorithms considered. It is also true that, expressed 
in the right framework, all the algorithms in the class behave in a very similar 
fashion^. This analysis complements that presented in [T], [T^IPPI] and Chapter 
7 of [TS] which concerns steepest descent. Moreover, the analysis in [TB] directly 
applies to all algorithms in the class considered, revealing the asymptotic be- 
haviour for bounded quadratic operators not only in R d but also in Hilbert 
spaces. The worst case behaviour exhibited is fundamental "bottom-line" in the 
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study of optimisation whose understanding is critical for building more complex 
and faster algorithms. 

The basic idea is renormalisation, as used throughout [15j . The main result 
in the finite dimension case is that for any algorithm in the class, in the renor- 
malised space one observes convergence to a two-point attractor which lies in 
the space spanned by the eigenvectors corresponding to the smallest and largest 
eigenvalues of the matrix A of the quadratic operator. The proof for bounded 
quadratic operators in Hilbert space stems from the proof for R d but is consid- 
erably more technical. In both cases, as in pQ, the method consists of converting 
the problem to one containing a special type of operator on measures on the spec- 
trum of the operator. The additional technicalities arise from the fact that in the 
Hilbert space case the measure, which is associated with the spectral measure 
of the operator, may be continuous. Another important result concerns bounds 
on convergence rates, named after Kantorovich, see [7]. For all algorithms in the 
family considered, the actual asymptotic rate of convergence, although satisfying 
Kantorovich bounds, depends on the starting point and is difficult to predict. 
This complex behaviour has consequences for the stability of the attractor, which 
are discussed following the main results. 

The family of gradient algorithms we consider, called P-gradient algorithms, 
is introduced in Section O Renormalisation is presented there, which, together 
with the monotonic sequences of Section 12. 4i forms the core of the analysis to 
be conducted. The main results are presented in Section [3l first for the case 
H = M. d , then for the Hilbert space case. They rely on the convergence property 
of successive transformations of a probability measure, which is presented in 
SectionUJ Again, the two cases H = M. d and TL a Hilbert space are distinguished, 
the exposition being much simpler in the former case. The stability of attractors 
is discussed in Section[5l only in the more general case of a Hilbert space, the case 
H = R d not allowing for a significant simplification of the presentation. Finally, 
Section [6] shows the asymptotic equivalence between several rates of convergence 
of gradient algorithms. All proofs and some important lemmas are collected in 
an appendix. 

2. A family of gradient algorithms 

2.1. P-gradient algorithms 

Let A be a real bounded self-adjoint (symmetric) operator in a real Hilbert space 
Ti with inner product (x,y) and norm given by ||x|| = (Xjx) 1 ^ 2 . Assume that A 
is positive, bounded below, and denote its spectral boundaries by to and M: 

to = inf {Ax,x) , M = sup (Ax,x), 
Nl=i H|=i 



with < to < M < 
quadratic form 



oo. The function to be minimized corresponds to the 
f(x) = ±(Ax,x)-{x,y). (1) 



Asymptotic behaviour of a family of gradient algorithms in R d and Hilbcrt spaces 



3 



It is minimum at x* — A 1 y 1 its directional derivative at x in the direction u is 

V«/(x) = (Ax -y,u). 

The direction of steepest descent at x is —5, with g = g(x) the gradient at x, 
namely g = Ax — y. The minimum of / in this direction is obtained for the 
optimum step-length 

(9,9) 
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(Ag,g)' 



which corresponds to the usual steepest-descent algorithm. One iteration of the 
steepest descent algorithm is thus 

(9k, 9k) 

x k+1 =x k - — ; rgk , (2) 

(Ag k ,g k ) 

with g k — Ax k — y and Xq some initial element in Ti. We define more generally 
the following class of algorithms. 

Definition 1. Let P(-) be a real junction defined on [m,M], infinitely differen- 
tiable, with Laurent series 

00 

P{z) = c kZ k , Ck G M for all k , 

—00 

such that < X^oo c k ak < 00 for a £ [m, M] . The k-th iteration of a P -gradient 
algorithm is defined by 

x k+ i = x k - -f k g k (3) 

where the step-length j k minimizes (P(A)g k+ \, g k +\) with respect to 7, with 
9k+i = 9{xk+i) = g(xk - 79k)- 



Direct calculation gives 



(P(A)Ag k ,g k ) 
/k (P(A)A2g k ,g k )- [) 



Note that AP(A) = P(A)A and that the denominator and numerator of j k are 
linear in P(A). Also, ~{ k is scale-invariant in P(A) and 7^ G [1/M, 1/m]. 

Taking = A^ 1 gives the steepest-descent algorithm. Choosing P(A) = 

I, the identity operator, is equivalent to choosing the step-length that mini- 
mizes the norm of the gradient g k +i at the next point. We then obtain the 
method of minimal residues introduced in 10 for the solution of linear equa- 
tions. For any fixed a G (0,1), choosing 7^ that minimizes af(x k — 7fffe)+ 
(1 — a)(g(xk — 7<?fe), g{x k — jg k )) with respect to 7 also gives an algorithm in the 
family. More generally, we show below how to construct P-gradient algorithms, 
with P(-) a polynomial in A, using evaluations of /(•) and <?(•) only. 
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2.2. Practical construction when P is a polynomial 



We consider the case where P(A) = A q for some integer q > — 1. (As men- 
tioned, the cases q = — 1 and q = respectively correspond to the methods of 
steepest-descent and minimal residues.) The extension to P(-) polynomial in A 
is straightforward (including also linear combinations with A -1 ), using 

The minimisation of (P(A)gk+i, gk+i), or the calculation of jk in Q, requires 
the calculations of terms of the form (A n g,g), with n = q or n = q + l,q + 2. 
As shown below, they are easily obtained from evaluations of <;(•) at different 
points. Notice that this construction implies that one iteration of the algorithm 
will require several evaluations of g(-). The construction proposed below is not 
necessarily the most economical one, and evaluations of /(•) and g(-) at different 
points could be combined to provide more efficient evaluations of terms {A n g, g). 
Our objective here is simply to show that the family of algorithms considered in 
the paper is not of purely theoretical interest, and that other algorithms than 
the steepest-descent and minimal residues could also be considered in practice. 

Let (A n g,g) be the term to be evaluated, n > 1, with g = g(x) the gradient 
at the current point x. Define x^ = x and 



with /3 a fixed positive number (for instance, can be taken equal to the value 
of 7 at previous iteration of the algorithm) . We obtain 

S« =g(x^) = (I~pAy g . 

Define Pi — (g,g^) — (g, (I — fiAYg). In matrix notation, P„ = Q n G„, where 

P„ = (P , Pi, . . . , P„) T , G n = ((g, g), (Ag, g),..., (A n g, g)) T 

and the entries of the (n + 1) x (n + 1) matrix Q„ are the binomial coefficients, 

\ 



/i 
l 
i 
i 

V = 



-0 

-2/3 (3 2 
-3/3 3/3 2 



-/3 3 



/ 



The value of (A n g,g) is then directly obtained from G n = Q n 1 P n . The entries 
of P„, defined by Pi = (g,g^), are also obtained more economically from 

Therefore, the evaluation of -fk — {P(A)Agk,gk)/(P(A)A 2 gk,gk), with P(-) a 
polynomial of degree q, requires \q/2] + 2 gradient evaluations (including the 
one at x^ — Xk). 
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2.3. Renormalisation 

We can rewrite the iteration |3j as 

(xfc+i - x*) = (xk - x*) - jk9k , 
with g k = g{xk) = A(x k — x*), so that 

(P(^)Ag fc ,g fc ) 

.9fe+i = 9k - lkAg k =9k- 757 /n 42 \ A9fe • 

{P(A)A 2 g k ,g k ) 

Define the renormalised variable 

z ( r ) - B ^ x ) (ft 

{X) (P(A)Ag(x),g(x))^> W 

with B — [P{A)A] 1 / 2 , the positive square-root of P{A)A, so that (z(x), z(x)) — 
1. Also define Zk = z(xk), 

n) = {A>z k ,z k ), j eZ, (6) 
so that /Uq = 1 for any k and 7^ = /Uq/Mi = We obtain 

(I - j k A)Bg k 



Zk+l 



(P(A)Ag k+1 ,g k+1 ) 1 / 2 ((I - 7fc^)5. 9fe , (/ - l k A)Bg k y/ 2 

(7 - j k A)z k {I - jkA)z k 

~ ((I - lk A)z k , (I - lkA)z k )^ ~ (1 - 2 7fc /x* + 7^ 2 fe ) 1/2 ' 

that is, 

(/ - A/j4)zk 

k+1 C/*§M) 2 -i) 1/2 ' Uj 

This gives the updating formula for the moments 

Mj = (^Z fc+1 ,Z fe+1 ) = ^/0i*)2-l • (8) 

In the special case where Ti = R d we can assume that A is already diag- 
onalised, with eigenvalues < Ai < A 2 < • • • < A^. We can then consider 
[zk]f, with [zk]i the z-th component of Zk, as a mass on the eigenvalue Ai, with 
J2i=ii z k]i — A'o = 1- Dcfi ne the discrete probability measure i>% supported on 
(Ai, . . . , Ad) by v k {\i) = [zk\i, so that its j-th moment is fi*, j e Z. We can 
then interpret ([7]) as a transformation v k — > Vk+i- The asymptotic behaviour of 
the sequence (z k ) generated by ([7]) was studied in pQ, see also [5] and Chapter 
7 of [15]. The main result is that, assuming < Ai < A 2 < • • • < \d-i < A<j, 
the sequence (zk) converges to a two-dimensional plane, spanned by the eigen- 
vectors ei, ed associated with Ai and Ad- The attraction property is stated more 
precisely in Section [3J also in the Hilbert space case. It is already important 
to notice that although the results in the references above were obtained for 
the steepest-descent algorithm, the renormalisation ([5]), which depends on the 
chosen P(-), makes them applicable to any algorithm in the family considered. 
Also, using the renormalisation just defined we easily obtain (non asymptotic) 
results on the monotonicity of the algorithm along its trajectory. 
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2.4. Monotonicity of a rate of convergence 

Consider the function {P(A)g k +\, g k +i) that "f k minimizes, and compute the rate 
of convergence r k of the algorithm at iteration k, defined by 

= (P{A)g k +i,9k+i) (q] 
" (P(A)g k ,g k ) ■ w 

Other rates of convergence will be considered in Section [5] where they will 
be shown to be asymptotically equivalent to r k . Direct calculation gives r k = 
1 — l/Lk, with 

Lk — 

where the moments /if arc defined by ((6]). Also, from ([8|), L k satisfies 

Lfc+i -L k = ^ det M k 

with 

D k = ^-{^) 2 

and 

M fe = ^ Q M * M f . (10) 
\Mi M2M3/ 

The moment matrix is positive semi-definite so that detMfc > 0, and thus 
Lk+i > L k , that is, both L k and the rate r k are non-decreasing along the tra- 
jectory followed by the algorithm. When H = K 2 (d = 2), detM fc = and 
r k is constant. When d > 2 or H is a Hilbert space, the rate is monotonically 
increasing for a typical xq, indeed, for almost all zq = z(xq) with respect to the 
uniform measure on the unit sphere when Ti = M. d . Notice that if the rate is 
constant over two iterations (det = 0) , then the measure v k is supported on 
two points only, and the iteration ([7]) for the masses shows that this situation 
will continue: the rate will thus remain constant for all subsequent iterations. 

Note that L k and D k are bounded (since v k has a bounded support), respec- 
tively by L* and D* , with L* = (M + m) 2 /(AmM) and D* = (M - m) 2 /4, see 
Lcmma[T]in Appendix A3. Therefore, since L k is non-decreasing it converges to 
some limit, and 

detM fc = ^ ~ L »™ < ^ ~ — > , k — > 00 . (11) 
Hi m 

In addition to L k and r k another quantity also turns out to be non-decreasing 
along the trajectory. Consider 

(P(A)Ag k+1 ,g k+1 ) _ {P{ A)Ag k+u g k+1 ) _ k _ k 2 = 

(P(A)A(x k+1 -x k ),(x k+1 -x k )) 7 l(P(A)Ag k ,g k ) ^ 2 k ' 

(12) 
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Direct calculation using ([7|) gives 

D k+l -D k = —j detN fc 
u h 

with 

/ Mo Mi M2 \ 
N fc = Mi A*2 Ms ■ 
\M2 Ms M4/ 

Again, is positive semi-definite and detNfc > so that Dk is also non- 
decreasing. It converges to some limit and det converges to zero for the same 
reasons as above. 

Substitution of P(A) for a particular algorithm shows which quantities are 
monotonia For the steepest-descent algorithm, P(A) = A^ 1 , (A~ 1 g k , g k ) = 
2[/(aifc) - f(x*)], and thus the ratios r k = [f(x k+1 ) - f(x*)]/[f(x k ) - f(x*)] and 
Dk = (gk+i,gk+i)/((x k+ i - x k ), (xk+i - x k )) are monotonically non-decreasing. 
For the method of minimal residues, P(A) — /, and the ratios r k = (gk+i, g k +i)/ 
(g k ,g k ) and D k = (Ag k+1 , g k+1 ) / (A(x k+1 - x k ), (x k+ i - x k )) are monotonically 
non-decreasing. 

The monotonicity and boundedness of L k and D k makes them suitable for 
studying the asymptotic behaviour of the algorithm. This is developed in the 
next section. 

3. Asymptotic behaviour of gradient algorithms 

Consider the case U = R d , and assume that the minimal and maximal eigenval- 
ues of A, Ai = to, Xd = M, are simple. The attraction property can be stated as 
follows. Choose zq = z(xq), the renormalised variable defined by ([5]) at the initial 
point xq, such that (zo,ei) > 0, (zo,ed) > 0, with e\ and the eigenvectors 
associated with Ai and Xd respectively. Then 

Z2k — > \pP ei + yl P e d , z 2k+ i -^x/l-pei-^/pe d when k — > oo , 

where p is some number in (0, 1), see Section [5] concerning the range of possible 
values for p. This property, stated in a more general framework in Theorem [T] 
below, has important consequences for the asymptotic rate of convergence of 
the algorithm, see Section [6l The proof of the attraction property relies on the 
convergence of successive transformations of the probability measures v k defined 
by [z k ]f. The approaches used in JT]j5] to study this convergence do not apply 
when TL is infinite dimensional, and we shall present a more general proof in 
Section |H It differs somewhat from the one in Chapter 7 of [To] , in particular in 
the choice of the monotonic sequence, (Lk) instead of (Dk). 

The attraction theorem in ~R d can be stated as follows. We can assume that 
A is diagonalised, and the probability measure v k is then discrete and puts 
mass [zk\i at the eigenvalue A^. Notice that the updating rule ([7]) is identical for 
[z k ]i and [z k ]j associated with A; = Xj, and the corresponding masses can thus 
be summed. We can therefore assume that all eigenvalues are different when 
studying the evolution of v k , see Theorem [3J 
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Theorem 1. Let A be adxd symmetric matrix, positive definite, with minimum 
and maximum eigenvalues m and M such that < m < M < oo and apply a 
P '-gradient algorithm, see Definition]^ for the minimisation of f(x) given by 
(QP, initialized at xq, with zq — z(xq), see |5|). Assume that 

E\Zq ^ and EdZo ^ , (13) 

where E\ and Ed are the orthogonal projectors on the eigenspaces respectively 
associated with Ai = m and Xd — M . Then the asymptotic behaviour of the 
renormalised gradient Zk — z(xk) is such that 

Z2k = ^/p~U 2 k + V 1 ~P v 2k , Z 2 k+1 = Vl -pu 2 k+l - y/pv 2 k+l , 

with \\u n \\ = \\v n \\ = 1 Vn, \\Au n — wun|| — > 0, \\Av n — Mv n \\ — > as n — > oo, 
and p, some number in (0,1), depending on Zo- 

The proof is omitted since we prove later a more general property valid for 
7i a Hilbert space. A more precise result is obtained when the eigenvalues Ai 
and Xd are simple: the vector Zd converges to the two-dimensional plane defined 
by the eigenvectors e\ and associated with Ai and Ad- 
Corollary 1 Let A be a positive-definite symmetric matrix with ordered eigen- 
values 

< m = Ai < A 2 < • • • < X d -i < X d = M 

and let e\, ed be the eigenvectors associated with Ai and Xd respectively. Apply 
a P-gradient algorithm, see Definition]^ for the minimisation of f(x) given by 
{I]), initialized at Xq such that zjei ^ and Zg ed ^ 0, with zq — z(xq), see {5p. 
Then the algorithm attracts to the plane PL spanned by e\ and ed in the following 
sense: 

w T Zk — > , k — > oo 

for any nonzero vector w G PI^. Moreover, the sequence (zk) converges to a 
two-point cycle. 

This corollary is a straightforward consequence of Theorem[T] when Ai and A^ 
are simple, with associated eigenvectors e\ and ed, u n and v n then respectively 
tend to ei and e<j. The result easily generalizes to the case when ([i"3"| is not 
satisfied. The algorithm then attracts to a two-dimensional plane defined by the 
eigenvectors a and ej associated with the smallest and largest eigenvalues such 
that z^Ci 7^ and z^ej ^ 0. 

We state now the attraction theorem in the more general case where PL is a 
Hilbert space. The proof is given in Appendix Al. 

Theorem 2. Let A be a bounded real symmetric operator in a Hilbert space PL, 
positive, with bounds m and M , such that < m < M < oo and apply a P- 
gradient algorithm, see Definition]^ for the minimisation of f(x) given by 
initialized at xq, with zq = z(xq), see 0). Assume that z$ is such that for any 
e, < e < (M -m)/2, 



(E m+£ z ,z ) > and (E M -eZ ,z ) < 1, 



(14) 
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with (E\) the spectral family of projections associated with A. The asymptotic 
behaviour of the renormalised gradient z k = z(xk) is such that 

Z2k = \fpu 2k + \/^-pV2k , Z 2 k+1 = V 1 -P u 2k+1 ~ VP W 2fe+l , (15) 

with || tin || = ll^nll — 1 Vn, \\Au n — mtt n || — » 0, \\Av n — Mw„|| — ■> as n — > oo, 
and p, some number in (0,1) ; depending on Zq. 



4. A property of successive transformations of a probability measure 

The two properties established in this section form the cornerstones of the proofs 
of the theorems of previous section. We consider first the case of a discrete 
measure with finite support, which in terms of convergence of a P-gradicnt 
algorithm corresponds to the case Ti. = M. d . The proof is given in Appendix A2. 

Theorem 3. Let vq be a discrete probability measure on {Ai, . . . , A^} with 

0<to = Ai<A 2 <---< Xd-i <\d = M < oo . 

Let [zk]i denote the mass placed at Ai by i/f., that is, Vk{\) = [zk\i ■ Consider 
the transformation T : t% — > vu+i defined by 

[Zk+l]t WUW^W (16) 

with the moments /if defined by fiSj). Then, when k — > oo, 

([z2k]i) 2 ^P, ([z2k+i]i) 2 -> 1-p and ([z 2 fc]d) 2 -> l~P, ([z2k+i]i) 2 -> P (17) 
/or some p depending on Vq, < p < 1. Furthermore, 

1 ■ P + l /l 

p 2 P -iy4 (p+i)a 

wii/j p = M/m and L = limfe^oo 

Note that the limiting value L depends on so that the value of p that 
characterizes the attractor is difficult to predict. The range of possible values for 
p is discussed in Section [5] 

We consider now the case of an arbitrary measure on an interval, which raises 
some additional difficulties compared to previous case. In terms of convergence 
of a P-gradient algorithm, it corresponds to the case where Ti. is a Hilbert space: 
for E\ the spectral family associated with the operator A, we define the measure 
Vk by Vk{d\) = d(E x z k , z k ), m < A < M. 
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Theorem 4. Let vq be a probability measure on the family B of Borel sets of 
(0,oo), with support [m, M], so that 

m = ess inf(^o) = sup(a / vo{x , x < a} = 0) , 
M = ess sup(^o) = inf(a / vq{x , x > a} = 0) . 

Assume that < m < M < 00. Consider the transformation T : v\. — > Vk+i 
defined by 

Vk+i(A)= [ {X ^ hl) \ k (d\) (18) 

for any A G B, where ^{ — J \vk{d\) and Dk — /i| — (a^i) 2 ; with /i§ = 
/ A 2 Vk(d\). Then, as k — > 00, 

^2fe(^) -> P, ^2fc+l(I) -> 1-P (19) 

/or a/Z I = [m, x), m < x < M , for some p depending on v$, < p < 1. 
The proof of Theorem U is given in Appendix A3. 

5. Stability of attractors 

The range of possible values for p in the attraction Theorem |T| (7Y = M. d ) is 
considered in Theorem 3 of pQ (see also Lemma 3.5 of [13]). Let s(A) and A* 
be defined by P0|) . This theorem states that when A* is not discarded at any 
iteration, that is, when \x\ ^ A* for any fc, then p e [1/2 - s(A*), 1/2 + s(A*)] 
(note that this assumption cannot be checked). In this section we extend this 
result in two directions: (i) we will assume that Ti. is a Hilbert space, (ii) we 
study the stability of the attractor defined by p in Theorem [5] We shall use the 
following definition of stability, see [B] p. 444, [11] , p. 7. 

Definition 2. A fixed point v* for a mapping T(-) on a metric space with dis- 
tance d(-, •) will be called stable if Ve > 0, 3a > such that for any v§ for which 
d{vQ, v*) < a, d(T™(^o), v*) < e for all n > 0. A fixed point v* is unstable if it 
is not stable. 

We shall use the distance d(v, v 1 ) given by the Levy-Prokhorov metric, see 
[213] p. 349. In our case (measures supported on [m,M]), d(v,v') becomes the 
Levy distance between the distribution functions F, F' associated with v, z/, 
which we denote 

L(F, F') = inf{e : F'(x - e) - e < F(x) < F'(x + e) + e , Vx} . 

In the case where one of the two measures is the discrete measure v* concentrated 
on m, M, with v*(m) — p, v*(M) = 1 — p, we get 

d{v, v ;) = L{F,F;) 

= inf{e : F(x) < p + e for x < M — e and p — e < F(x) for m + e < x} , 

with F* the distribution function associated with v*. We then have proved the 
following, see Appendix A4. 
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Theorem 5. Consider the situation of Theorem^ with v$ any probability mea- 
sure supported on some closed subset SSa of [m, M] and 

ess inf(fo) = m , ess sup(fo) = M . 

(i) The measure v* is a fixed point for the mapping T 2 . 

(ii) Consider the setX u defined by 

2 u =(o,i-*(A*))uQ+*(A*) ) l 

where 



y/(M - Ag + (A - m) 2 . 

s(A) = — 7T— r , A = mm s(A) . (20) 

v ; 2(M-m) \ess A w v ; 

j4ny fixed point v* with p in T u corresponds to an unstable fixed point for T 2 . 
(Hi) Any point in the interval 

1*= Q-s(A*),^ + S (A*)) (21) 
corresponds to a stable v* for the mapping T 2 . 

Remark 1. The convergence d(vk,v*) — > is equivalent to weak convergence 
vt -— > Vp in the usual sense. If is associated with the spectral measure Vk 
and z* with v* , then, in the Hilbert space this is equivalent to (zk — z* , y) — > 
for any y £ H, whereas strong convergence would require \\zk — z*\\ — *■ 0. For 
R d , the two types of convergence are equivalent, and thus Corollary [T] implies 
strong convergence. However, for TL a Hilbert space the equivalence is false, and 
indeed strong convergence generally does not hold. The stability property (iii) is 
thus a weak statement when TL is a Hilbert space. The L2 metric in TL induces 
the Hellinger metric on the space of spectral measures, which defines the same 
topology as the distance in variation, see [5D], p. 364. Strong convergence in TL 
is thus related to distance in variation in the space of spectral measures and is 
clearly difficult to obtain — except in the special situation where vq has positive 
mass at {to} and {M} and presents a spectral gap: vo[(m, m + e)] =0 and 
vq[{M - e, M)] = for some e > 0. 

We have Vk+i{dX) = H{vk,X)Vk{dX), with i?(z^,A) given by (p?5)) in Ap- 
pendix A4. One may then notice that when vq is a discrete probability measure, 
the condition H(y*, A) > 1 used in the proof of the instability part of the the- 
orem, see Appendix A4, corresponds to a condition on the eigenvalues of the 
Jacobian of the transformation T 2 , see [To] . 

Note that the stability interval X s always contains the interval 



1111 



2 2^2' 2 2\/2 



(0.14645, 0.85355) 



Numerical simulations for TL = K 3 , with A having eigenvalues m < A < M, show 
that for any initial density of xq in M. d associated with a density of z reasonably 
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4.5 p 
4 




Fig. 1. Empirical density of attractors (full line) and f>(p), see 1221 ). for d = 3 (m = 1, A = 4, 
M = 10) 

spread on the unit sphere, the density of the values of p corresponding to stable 
attractors v* can be approximated by 

*>-c.«.*<*a>>,- *J^L, <*> 

where C is a normalisation constant and H(v*,\) is given by ([50]) . Figure [1] 
shows the empirical density of attractors (full line) together with <p{p) (dashed 
line) in the case m = 1, A = 4, M = 10. The support of this density coincides 
with the stability interval I s given by (f2~Tj) . When d > 3, the density of attractors 
depends on the initial density of xq. 



6. Rates of convergence 

We first state a property showing that different definitions of rates of convergence 
are asymptotically equivalent, see Appendix A5 for the proof. 

Theorem 6. LetW be a bounded positive self-adjoint operator inTL, with bounds 
c and C such that < c < C < oo. Assume that W commutes with A (when 
Ti = M. d , W is a d x d positive- definite matrix with minimum and maximum 
eigenvalues respectively c and C). Define 

H k[W) - — r 

(Wg k ,g k ) 
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'/ llSfcll 7^ an d Rk(W) = 1 otherwise. Apply a P -gradient algorithm ini- 
tialized at Xq, with 7^ given by Q), for the minimisation of f(x) given by 
with minimum value at x* . Then the limit 



R(W,x ,x*) = lim 



n — >OG 



l/r, 

n 



k=0 



exists for all xq,x* in Ti. and R(W, Xq,x*) — R(xq,x*) does not depend on W. 
In particular, 

/n-1 ^ 

R(W,x ,x*) = lim TTr fc 

r). — >m \ 



l/r, 



\k=0 



with rk defined by (0|). 

From the results of Section [3j we have 

p(l-p)(p-l) 2 



R(W,x ,x*) = r(p) = 



\p + p(l-p)][(l-p)+pp\ 



for any W, where p defines the attractor, see (JT5J) , and p = M/m is the condition 
number of the operator. The function r(p) is symmetric with respect to 1/2 and 
monotonously increasing from to 1/2, see Figure [2] The worst asymptotic rate 
is thus obtained at p = 1/2: 

Note that Vfc, rk < -Rmax since r k is not decreasing, see Section |2~41 For a 
typical xo (such that the convergence is not finite, that is, such that r(p) ^ 0), 
the stability analysis of Section [5] shows that only values of p in T s given by (|2ip 
may correspond to stable attractors. The range of possible values of R(p) is thus 
[-Rmin, -Rmax], where i? max , given by (|23p . is obtained for p = 1/2 and 

J?mta< JCin = fl(l/2+V[2>/2])= (/ " M ~ 



{p+l) 2 +Ap 



Figure [3] presents the range [i? min ,i? max ] as a function of 1/p, the upper curve 
corresponding to i? max and the lower to R* nin . The maximum size of the range 

is 3 - 2^2 ~ 0.1716, obtained at p = 1 + 2^2 + 2^2 + ^2 ~ 7.5239. These 
results confirm the experimental observation that the rate of convergence of the 
gradient algorithm is generally close to its worst value i? m ax, see [14]. The same 
property is true for any P-gradient algorithm. 

Remark 2. A similar analysis for Dk defined by (p~2|) . which is also not decreasing, 
shows that D k -> D(p) = p(l - p)(M - m) 2 as k -> oo, with D k < D* = 
D(l/2) = (M — m) 2 /4 for all k. Also, for any typical x such that p£l s given 
by (Ell), we have D(p) > D(l/2 + l/[2\/2]) = (M - m) 2 /8. 
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0.1 








0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

Fig. 2. r(p) as a function of p, for p = 2 (bottom curve), 4, 8 and 16 (top) 



Another quantity of interest is given by 



A N = log(P max /P m in)/[log(P max ) log(iJmin)] • 

Indeed, for N large enough, (WgN, 9N)/(Wgo,go) — r(p) N , the number N of 
iterations required for obtaining a ratio (WgN, gN)/(Wgo, go) = e (e < 1) is 
approximately log(e)/log[r(p)] and An\ log(e)| thus indicates the length of the 
interval of possible values for N due to the range of possible values for p. Direct 
calculation gives An | log(P m ax)| < 1/2 for any p and 

A N = p/8 - 1/4 + 0(1/ p) , 1/ log(P max ) = -p/4 + 0(1/ p) 

for large p. Therefore, the number of iterations required by a P-gradient algo- 
rithm to achieve a given precision e << 1 varies at most by a factor 2 depending 
on the (typical) starting point x , factors of variation close to 2 being possible 
only when p is large. 

The average value of R(W, xq, x*) for zq = z(xq) uniformly distributed on 
the unit sphere is the same for any P-gradient algorithm, more generally, the 
distribution of R(W, xq 7 x*) associated with a particular distribution of zo does 
not depend on the particular P-gradicnt algorithm considered. Moreover, numer- 
ical simulations show that the average value of R(I, xq,x*) is the same for the 
steepest-descent (P(A) = A^ 1 ) and minimum residues (P(A) = I) algorithms 
for xq uniformly distributed on the sphere ||a;o — %*\\ = 1- The small deviations in 
average performance between different P-gradient algorithms can only be related 
to the fact that a fixed distribution for x corresponds to different distributions 
for z(x ). 
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0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

Fig. 3. Range [i£Li_, Kmax] of possible values of the asymptotic rate r(p) as a function of 1/p 

Remark 3. It is known that the introduction of a relaxation coefficient 7, with 
< 7 < 1, in the steepest-descent algorithm totally changes its behaviour, 
see, e.g., Chapter 7 of [15]; the algorithm |2]) then becomes Xk + i = Xk — 
l[(9k,9k)/(-^9k,9k)]9k- For H = R d and a fixed A, depending on the value 
of 7, the renormalized process either converges to periodic orbits (the same for 
almost all starting points) or exhibits a chaotic behaviour, with the classical 
period-doubling phenomenon in the case d = 2. In higher dimensions, repeated 
numerical trials show that the process typically no longer converges to the 2- 
dimensional plane spanned by (ei, e<j). A detailed analysis for d = 2 and experi- 
mental results for d > 2 also show that relaxation (with 7 close to 1) considerably 
improves the rate of convergence. Similar results hold more generally for all P- 
gradient algorithms, with the iteration ^ transformed into Xk+i — Xk — "/Jk9k, 
with 7 the (fixed) relaxation coefficient and 7^ given by (HJ. Steepest descent 
with random relaxation coefficient 7 £ (0,2) is considered in :19], avoiding the 
two point attraction and significantly improving the behavior of ordinary steep- 
est descent. 

Appendix 

Al. Proof of Theorem [H The proof relies on Theorem 0] (Theorem [3] when 
Ji = R d ), which concerns successive transformations applied to a probability 
measure. 

Since A is self-adjoint, its spectrum SSa is a closed subset of the interval 
[m, M] of the real line and m, M € SSa- Let E\ be the spectral family associated 
with A, and define the spectral measure by Vk(d\) = d(E\Zk, Zk), m < X < M. 
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Since (z k , z k ) = f M v k (dX) = 1, v k is a probability measure on the Borel sets of 
(0,oo), with Vf.([m, M]) — 1 V7c. This representation gives 

Hi = (Azk,Zk) = / A^ fe (dA), /i 2 = (A 2 z k ,z k ) = / A 2 v k {dX) 



where integration is over [to, M] unless otherwise specified. Therefore, for any 
Borel set A the transformation ([7]) gives in terms of v k ■ 

J A [x-J\'MdX')} 2 MdX) 

JX'^ k (dX')- [JX'v k {dX')] 2 

The conditions (|14p on zo are equivalent to ess inf(^o) = and ess sup(fo) = M , 
see Theorem SI and the updating rule for v k can be written as (fTS)) . Theorem 0] 
then implies (fl9|) , which can be written as: Ve > 0, e < /? = (M — m)/2, 

(E m+e z 2kl z 2k ) —*p, {E M - t z 2k , z 2k ) — >p, 

(^m+e^2A+li 2 2fc+l) — > 1 — p, (EM-eZ 2k +i , Z 2k +i ) — > 1 — p , 
as fc — > 00, where p depends on zq, <C jp <C 1. Define P2fc — {$m j t-f3%'2,k , > %2k)i 

P 2fc+1 = 1 - (ig m+ /3Z2fc+l,Z2fc+l) ; and tn e an g les ¥>n b Y cos V = \/P> sin V = 
V 1 = Pi cos ¥>n = \ZP"' Sm V?n = V 1 = P«; Also define S 2 fe = E m+ fjZ 2k / COS (/? 2 k, 

«2fc+i = E m+ pz 2k+ i/ sin y>2fc+i > *2fc = {z 2k —E m+ {3Z 2k )/ s\ntp 2k ,t 2k+ i = —(z 2k+ i — 
E m+ pz 2k+ i)/ cosyafe+i- This gives p n ^pasjnoo, ||s„|| = ||<„|| = 1 Vn, and 
z 2 A; = cos ip 2k s 2k +sin(p 2k t 2k , z 2k +i = sin ip 2k+1 s 2k+ i - cos<p 2k+1 t 2k+1 . Also, 



||As n - ms„|| = / (A -to) d(Sxs n ,s„), 



which, for n = 2fc and any e, < e < j3, gives 

„„ ,,2 r +p (a -m) 2 

||As 2 /c - ms2fe|| = / d(E x z 2k ,z 2k ) 

Jrn P2k 



m+e (A - to) 2 ,. . /■ m+/3 (A - to) 2 

d(E\z 2k ,z 2k ) + / d(E\z 2k ,z 2k ) 



e 2 /3 : 
< + — 



P2k J m+e P2k 

m+e 



P2/c 



/ d(E\z 2kl z 2k ) 
J m 



3 1 

P2k P2k 

Since p 2 fc — > P and f™ +e d(E\z 2k , z 2k ) — > p as fc — > 00, ||As 2 fe - ms 2 fe|| ^ as 
fc — > 00. Similarly, ||As2)fc+i — iTis 2k+ i\\ — * as k — * 00 and \\At„ — Mt n \\ — ► as 
n — > 00. Consider now 

tt n = cos i? n s„ + sin t„ , u n = - sin $n «n + cos i?„ i„ . 

Straightforward calculations show that = ip n — ip gives (|15[) with = 
||u n || = 1 Vn. Also 

||Au n - mu n \\ < |cos#„|||As n - ms n \\ + \ sini9 n |(M - m) , 
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and, since \\As n — ms n \\ — > 0, i9 n — > as n — > oo, ||Ait„ — mu ra || — > as n — > oo. 
Similarly, \\Av n — Mv n \\ — > as n — > oo. ■ 



A2. Proof of Theorem^ We first prove that the mass of v k tends to concen- 
trate on two eigenvalues only. When vq is non degenerate, L\ > 1 from Jensen 
inequality, and thus, since (Lk) is non-decreasing, see Section [2.41 Lk > L\ > 1. 
Now, from Lagrange identity af)Q] 6f) = J2 i<:j (aibj - a-jh) 2 + (J2 a i b i) 2 

2 (A, - A 



Let ifc and denote the indices that achieve maxi< 3 -[«jfe]j We have 

i<j 1 3 

and thus 

> ^ = ■ 

Moreover, [z k }j k + [zk] 2 jk < 1 gives 

s < \ z k]l k <l-S and 6 < [z k ] 2 jk <l-5. 
Consider the matrix given by (|10[) . Its determinant can be written as 



detM fc = [ z k\ 2 [zk] 2 [zk\ 



2 (^i — ^j) 2 (^i ~ ^l) 2 (^j ~ ^/) 2 
I 



AiAjA/ 



55 



* ^ E r**i? 



M 3 



where 



5\ = min |Ai - Ay| 

id 



Since detM fc — > as fc — > oo, see ([TT|). we get X^i fc j fc [ z fc]| — > as fc oo. The 



mass thus tends to concentrate on \i k , Xj k . 
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Next we prove that ik and jk eventually become fixed. From the result above, 
Ve > 0, 3k £ such that Ei#i fe ,i fc M? < k > k e . 

Consider the updating equation (fT6|) . We have for any i, (p\ — Xi) 2 < (M — 
to) 2 . Also, D k = [i\ — (mi) 2 > D 0l see Section l2~4l This gives for i ^ ik,jk and 
k > k e 

(M-mf 



[zk+i]i < e- 



Df) 



Taking t\ — SDq/(M — m) 2 we obtain [z^+i] 2 < 6 for i ^ ik,jk and k > k ei . 
Since [z k+ i] 2 lk+1 > S and [z k +i] 2 > S, i £ {ik,jk} implies i £ {ik+i, jk+i}, 
k > fc ei and thus {ik, jk} = j*} f° r k> k ei . 

We show now that {«*, j*} = {1, d}. Assume that i* < j* < d (which implies 
[z k ] d — > 0, k — > oo). We need to show that (Ad — pX) 2 > (Xj* — p%) 2 for k large 
enough. We have 

& = A,. [z fc ] 2 , + \ r [z k ] 2 , + X ' ^ A ** ^ + V Nj* + Xd Yl N? ■ 

Take e 2 = minjei, SS\/Xd}- For > k t2 we have 

Mi < \*[zk]i*+^A z k?r+ X d£2 < X l ,5+X j ,(l-S) + X d e 2 < Xj,-55 x + X d e 2 < Xj, 
and thus (Ad — p\) 2 > (Xj* — Mi) 2 - From (TlB|) . this gives for k > k, 



<2 



[Zk+x]d\ _ (Ad -Mi') 2 > (Aj. — Mi) 2 _ f [*k+i]i» 



[2fc]d J D k D k V [ z fe] 

and thus 



[zfc+l]d / \ 

We arrived at a contradiction since [zfc] 2 , — > and [;Zfc] 2 . is bounded from below 
by <5. Therefore j* = d. Similarly, i* = 1. 

Finally, let L denote lim^oo see Section |2~H There are only two discrete 
measures with nonzero masses on Ai and Ad and such that MiM-i = L, 

v^ = { X \ Xd \ and v^= ' Al Ad 

with 



pi — p \ 11— p p 




and p = M/m. Direct calculation shows that Vk = gives Vk+i — v^ 2 \ her 
the convergence of Vk to the cyclic attractor z/ 1 ) — > v 1 ^ — > i/ 1 ) — > ■ • • 



A3. The proof of Theorem|4]is more technical than that of Theorem[3]and relies 
on a series of lemmas stated below. 
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Lemma 1. Let v be any probability distribution on [m,M], < to < M < oo 
with moments [ii = J AV(dA), i e Z (fXo = 1). Then, 

M2 - Mi < D* = (M - mf/A (24) 
mH-i < L* = (Af + to) 2 /(4toM) . (25) 

Proof. The proof relies on standard results in experimental design theory, see, 
e.g., [4ll2Tj. Consider the two linear regression models 771(0, A) = 9q + 0\X and 
772(0, A) = 9q/V\ + 0ivA, with 0o,0i the model parameters and A the design 
variable, A £ [to, M]. Z?-optimum design (approximate theory) aims at deter- 
mining a probability measure on [to, M\ that maximizes the determinant of the 
information matrix associated with a particular model, here respectively 

I 1 M = ( Wl> ) and I 2 (,) = (ViM . 
\Mi M2/ \ no fJ-i J 

The function log det I(z^) is concave on the set of probability measures on [to, M] , 
and its maximum is unique. The Kiefer-Wolfowitz General Equivalence Theorem 
[8] gives a characterization of the measure v* that maximizes det Ii (v) = ^2 — Mi 
and det 12(7^) = M1M-1 — 1. In this case it corresponds to the two point measure, 
supported at m and Af, with both masses equal to 1/2. Direct calculation gives 
(|24l25p . One may notice that (|2l))) corresponds to the Kantorovich inequality, see 
[7] and 12J, p. 151. (A full development of this connection is presented in [17].) 

■ 

Lemma 2. Let v be any probability distribution on [to, M], < TO < M < 00. 
Assume that there exists an interval X C [to, M], \T\ < a and v(T) > 1 — e, 
e S [0, 1]. T/ien, Var(^) < a 2 /4 + 2eM 2 . 

Proof. Define 7J1 = J [m M] A z^(dA), fi x = J x A z/(<2A). Then 711 = 711+ / [mM]vr A v(d\). 
Therefore, 7x1 < fi\ < fix + eM. We get 

Var(i/) = y (A - /ii) 2 i/(dA) < J (X-^) 2 v(dX) + (M - m) 2 e 

= J (A - Mi) 2 K<*A) + (Ml - Mi) 2 ^(I) + (M - ™) 2 e • 

Lemma [1] implies ^(A — 7J1) 2 f(dA) < a 2 /4 and (7^1 — /ij) 2 < e 2 M 2 gives 
Var(z/) < a 2 /4 + e 2 M 2 + M 2 e < a 2 /A + 2eM 2 . 

■ 

Lemma 3. Let v be any probability distribution on [m,M], < TO < M < 00. 
Assume that Var(z^) < e. Then, there exist an interval T such that \I\ < e 1 / 4 
and v(l) > 1 - Ay/e 



Proof. TakeZ = [77! - e 1 / 4 / 2 , Mi + e 1/4 / 2 L Mi = / Xv(dX), and apply the Cheby- 
shev inequality. ■ 



20 



Luc Pronzato et al. 



Lemma 4. Let v be any distribution on [m,M], < m < M < oo. Define 
jii = J A* v(d\) and 

I M-i Mo Mi 
M = /x Hi ^2 
\Ml M2M3, 

Assume that L = /i-i/ii > 1 (which, by Jensen's inequality, holds when v is not 
degenerate at a single point) and detM < e. Then, there exist two intervals 1\ 
and X2 such that 

d) \M < J^gg^ . « = i,2, Kii) + v&) > 1 - 4^Af ^ , 



3(L-1 



■in 



2 



(ii) max|x-/i_i| > — — r ,i=l,2, (26) 

xei; 4(M — mj 

. . , 4(L - l) 8 M 8 m 16 

(my /or e < e* = 



[32(L- 1) 3 + M 4 (M-m) 2 ] 2 ' 
and max |a; — y| > my/2(L — 1) . 

Proof. 

(i) Consider the measure 2/ defined by ^'(-4) = fJl / \)u(d\) for any 

Borel set A C [m,M], and denote its moments by fj^ = (1//H_i) J \ % ~ l v(d\) = 
Note that for any Borel set A 

' v(-4) < v'{A) < — !— v{A) . 



Mpb-x m/i-i 

We have 

/ Mo Mi M 2 
M' = m'i M2 M 3 ) = M/ M _ 
\M 2 Ms M4, 

and thus detM' = detM/fi__ v Also define D' = [i' 2 - (fi[) 2 , a = \Jly , b = 
(M1M2 - M^/VP* 7 , c=afi' 2 + bfj,[ = [{n' 2 ) 2 - v[^' 3 }/Vly (note that a > 0, b < 
and c < 0) and 77 = P(C) = aC 2 + ^C — c > with £ having the distribution v 1 . Direct 
calculation gives £"{77} = / n{C,)v' {dQ = and Var'fa) = E 1 {if} - (E'{r]}) 2 = 
detM', so that detM < e implies Var (ry) < e' = e/fi 3 _ 1 . From Lemma |3l 
the interval 1 = [-(e') 1 / 4 /2, (e') 1 / 4 /2] is such that Pr{r; e X} > 1 - 4-v/e 7 . 
Also, from the mean-value theorem, there exist Ai < A2 such that Ai S [m, M] 
and aAf + &Ai — c = 0, i = 1,2. Direct calculation gives F(/j,[) — P(/i_i) = 
a(/4) 2 + 6/ii - c = -(P/) 3 / 2 , and thus 

m < Ai < I//X-1 < A 2 < M. 

Take /3 = (M- m)(e') 1/4 /[2(P/) 3/2 ], we get 

F(Ax + /3) < -(e') 1 / 4 ^ , F(A X - /3) > (e') 1/4 /2 , 
P(A 2 + /?) > (e') 1 / 4 /2 , P(A 2 - 0) < -(e') 1 ^/2 , 
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and v{l x ) + z/(T 2 ) > 1 - 4^/iM 3 / 2 when X t = [A* - [3, A, + /?], i = 1,2, with 
\Xi\ =2(3= (M - m)e 1 /4 /i 9/4/( i _ X )3/2 < ( M _ m ) e l/*/[ m »/*/(L - if/ 2 ]. 

(ii) Define ?/i = /Ltj — Ai, j/a = A2 — m'ij so that max,,,^ |ir — > ?/i and 
max ie i 2 |x — > ?/2- We have F(X) = a(A — Ai)(A — A2) and thus 2/12/2 = 
-F(/xi)/a = £>'• Also, |s/2 — S/xl < 2/1+2/2 < M-m, so that D' > yi (y t + M-m), 
i = 1,2, and thus 



M - m 



AD 1 



- 1 



> 



D' 



M - m 



1 - 



D' 



(m - m y 



i = 1,2. 



(M ~ m) 2 

Lemma [T] gives D' < (M - m) 2 /4, so that 

y t > 3D'/[A{M - m)] > 3(L - 1)to 2 /[4(M - to)] , i = 1, 2 . 
(iii) Define 7 = 1/(22), part (i) implies i/(Zi) > 1 — 4vV — 7, and from Lemma[2] 



which gives 



D' < 



7> 



A{D'f 



D' 

2M 2 



+ 2(4V7 + 7 )M 2 , 
(M - m) 2 



8(D') 3 M 2 



and thus 7 > D'/(AM 2 ) >(L- l)m 2 /[4M 2 ] for e < e* < [4(L>') 8 ]/P^ - m ) 2 + 
32(£>') 3 M 2 ] 2 , see dHJ). 

Define now A = max X £i ll2/ ei 2 l x — 2/1- Lemma [2] gives 
which implies A 2 > AD' - 32M 2 V?. Since e < e* implies Ve 7 < L>'/(16M 2 ), we 
get Z\ 2 > 2(i - l)m 2 . ■ 

Proof of Theorem ^ The proof follows the same lines as that of Theorem [3] 
and is divided into four parts. In (i), we construct sequences of intervals C k = 
[m k , m k + S] and lZ k = [M k — 5, M k ] in which the measure v k will tend to 
concentrate. In (ii) we prove that lZ k H Ti-k+i 7^ and in (iii) that the sequence 
M k is non-decreasing. Finally, the limiting behaviour of v k is derived in (iv). 

(i) We have seen in Section [2~4l that detM/j — » as k — > 00, with Mfc given by 
(fTUf . Therefore, given e, 3if e such that Vfc > K e , det M fc < e. Define = [i\\x k _ x 
and note that L k > 1 because no v k is degenerate at a single point. Using 
Lemma 21 for e small enough, for any k > K e there exist two intervals X\ , X\, 
with width at most 

(M-m)e 1 / 4 
^ m9/ 4 (L - 1)3/2 ' 

and such that v k {l%) + v k (I$) > I - A^/eM 3 / 2 , u k {I^) > m 2 (L k - 1)/ (4M 2 ), 
i*(X£) > m 2 {L k -l)/{AM 2 ). Also, max^g^^ | as — 2/] > my/2(L k - 1). With- 
out any loss of generality, assume that X\ is the interval on the left. Define 
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C(x) = [x, x + 5], 1Z(x) = [x — 8, x], 

Xt = Argmax{^ fe [£(a;)] , C(x) nlf + 0} , 

X 

X k R = Argmax{u k [Tl(x)} , K(i)nl 2 fc ^ 0} , 

X 

and m k = minA^, M k = max X R , C k = £(m k ), TZ k — TZ(M k ); that is, M k is the 
right endpoint of an interval lZ k , intersecting T% , with maximum measure, and 
similarly for m k and C k . Note that v k (C k ) + v k (JZ k ) > 1 — 4 v / eM 3/ ' 2 , v k (C k ) > 
m 2 (L k - 1)/(4M 2 ) and v k (K k ) > m 2 (L k ~ l)/(4Af 2 ). The situation is the same 
for the two sequences of intervals (Ck) and (7Z- k ), and we concentrate on (TZk) 
in the rest of the proof. 

(ii) We show now that 7Z k n 7l k +i ^ 0- Again for e small enough p'l ^ lZ k and 
A — Mi — -^fe — ^ — Mi on ^fc so that 



fc\2 



«*fi(^) = / V „ > -^-^(M k - 5 - pf) 

TO 2 (£fc-l) fe2 

- 4Af 2 i^ ( A ^-^-^i) 

with 13* the maximum possible value of D k , D* = (M — m) 2 /4, see Lemma[TJ 
By construction, max ieI 2 \x — p[ | < M k + 8 — p"[, and thus, from Lemma [J] 

,. „ 3rn 2 (X fe -l) 3m 2 (i -l) , x 

M k - p\ + S> — V= > 77 , ° = C . 28 

P1 4(M-m) ~ 4(M-m) v ' 

Choosing e such that 8 < C/4 gives Mfc ~ 8 — p\ > C/2 and thus 

m 2 (L fc - 1) C 2 , _ 9m 6 (L - l) 3 
> 4APD* T " * _ 16Af 2 (M - m)4 " 

Choosing now e such that 4-^eM 3 / 2 < v* R we obtain TZ k n T^/s+i 7^ for any 
fc > K e . 

(iii) We prove now that the sequence (M k ) is not decreasing starting at some 
K e for e small enough. Take k > K c and assume that M k+ \ = M k — (3, (3 > 0. 
Then note that (3 < 8 since H k <~)H k +i 7^ by (ii) above. Consider the difference 
^fc+i(^ fc ) - v k+1 (TZ k+ x) = v k+ i([M k - /3,M fc ]) - u k+1 ([M k ~8 ~ (3,M k - 8}). 
Assume first that v k+1 ([M k -8-f3,M k -8]) = 0, then v k+1 (TZ k ) > f k+1 (TZ k+1 ), 
which is impossible by construction. We can thus consider the following ratio 

M » 1 \ .M2. 



u k+1 ([M k -(3,M k }) J Mfc %(A-^) 2 z/ fe (dA) 



v k +i{[M k -8-{3,M k - 8}) fg"Zf_J\ - p\) 2 v k (d\) 



> 



M k -8-p\ 

(M k -(3~p\) 2 v k ( [M k - (3, M k ] ) 



(M k -S- pf) 2 u k ([M k -S-P,M k - 8]) 
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Since M k -5-^ > C-25 > 25 for C > 45, see {28), and < 5, (M fe -/?-/^) 2 > 
(M k — 5 — Hi) 2 . Also, by construction, 

< v k {[M k - /3, M k \) - v k [\M k -5-f3,M k - 5]) 
= u k ([M k - (3, M k ]) - u k ([M k -5-(5, M k -5]). 



This gives 



is k+1 {[M k -f3,M k }) 
v k+1 ([M k -6-P,M k -5}) 



Therefore, (5 > leads to v k+ i(lZ k ) > v k+ i(JZ k+ \), which is impossible. We thus 
obtain M k +i > M k for k > K e . 

(iv) Since the sequence (M k ) is non-decreasing and bounded from above (by 
M), it has a limit M* > M. The same is true for m k , and toj. — > to* as A: — > oo. 
We have thus proved that for any 5 small enough and any k larger than some 
K s , 

4M 3 /2 m 9/2(i Q _ 1)3^2 

i*([M*-5,M m ]) + Vk([m.,m* + 8\) > 1 ... .„ — . 

(M — my 

Assume that M* < M. This would imply v k ([M — 5, M}) — > as k — > oo for 
<5 < M — M*. On the other hand, 

^+i([M-5,M]) i*([M-<J,M]) 



i*+i([M. - 5, M*]) */ fc ([M* - <J, M*]) ' 

which leads to a contradiction since v k ([M — 5,M])/v k ([M^ — 5, M„\) is then 
increasing and v k ([M* — 5, M*]) is bounded from below. Therefore, Af* = M, 
and similarly to* = to, with, for 5 small enough and any k larger than some Ks, 
is k {[m + 5,M - 5}) < 4M 3 / 2 m 9 / 2 (L Q - l) 3 5 2 /(M - m) 2 . Finally, from Helly's 
Theorem, see [20,, p. 319, from the sequence (v k ) we can extract a subsequence 
(f ki ) that is weakly convergent, and from the result above the associated limit 
has necessarily the form v*, where v* is the discrete measure concentrated on 
the two points to, M, with u*(m) — p, v*(M) = 1 — p. Since L ki converges to 
some L, v* is such that the associated value of /ii/i-i is equal to L, which only 
leaves two possibilities for p (and 1 — p): 




where p = M/m. Applying the transformation T, we get v k +\ — T{y k .) 



A4. Proof of Theorem\^ 

(i) It is straightforward to check that T 2 {v*) = u*, Vp 6 (0, 1). 
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(ii) We assume that SSa is not reduced to {to, Af} (otherwise T u = 0). We have 
v k+2 {dX) = H(v k ,X)i/ k (dX), with 

.K^ l'-f'-^ (29) 

see (IT5|) . with /i^, defined as in Theorem 2J For vp. = v*, it gives 
* [Af (l - p ) + mp - A] 2 [Afp + m(l - p) - A] 2 

^' A) = p2 (1 _ p)2(M _ m) 4 • (3°) 

One can then check that for any p £ X u , maxAess^ H(u*, A) = H(y*, A*) > 1, 
with A* = minAe55 A s(A). Therefore, for any p G T Ul one can choose e small 
enough, such that d{vk,v*) < e implies Vk+2 {[a.b]) > K p Vk{[a, &]), for some 
K p > 1 and some a, b such that m + e<a<b<M — e and [a, b] (~1 SSa ^ 0- 
For any a > 0, a < 1 — p, take an initial measure i>q putting mass p at to, 
1— p — a at M and a in the interval [a, b]. It satisfies d(v$,v*) < a, and, for any 
to, either d{v 2mi v*) > e or ^2(m+i) ([ a > &]) > K p i/2 m ([a, b]). The later case gives 
l/ 2m([0',b]) > 2e, and thus d(v 2m ,v*) > as soon as to > log(2e/o;)/ log(-Kp), 
which shows that v* is unstable. 

(iii) Part (a) concerns the case where a spectral gap is present, with point mass 
at to and Af . The proof for the general situation is more technical and is sketched 
in part (b). 

(a) Assume that the measure v$ has a spectral gap: vq = on (to, to + s) and 
(Af — s, Af ) for some s > 0. Take 7 < s and assume that d(vo, v p ) < a < 7 with 
p 6 I,. The arguments go as follows. First we bound ^{(m+'y, M— 7]} by 2K a 
for some -fsf < 1, then we bound z/ 2 {(Af — 7, Af]} by l—p+Kia for some Ki < 00. 
We show that d(v 2 , v* 2 ) < K a for some p 2 such that \p 2 — p\ < {K + Ki)a. 
Stability will then follow by an induction argument. 

The maximum value of H(i>q,X) for A varying in [to + 7, Af — 7] may be 
reached for some A* £ (/z°, n\) or at one of the two points m + 7, Af — 7. Now, 
for a small enough H(vq : A) will be close to H{v*, A) given by ([30]) . and p € I s 
implies 

max H(u ,X)<l. (31) 

Consider the function H(i/o, A) at A = Af — 7. Wc can write 

dH(v*X) 

H(v 0> M- 7 )=H(v;,M)- 1 ^-L +F II (u;;u ,M)+O( 1 2 ), (32) 

F d\ \x=m ' 

with Fh{v*] vo, Af) the directional derivative of H(v,M) at in the direction 



h[(i - 0)p; + @vq,m] - h(p;,m) 

' " /CoV 



F H (v*;v 0l M) = lim 
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Define Fu{Vp,x,X) = Fh(v*;5 x ,\) with 5 X the delta measure supported at x. 
We have 



F H (y*\vo,M)= F H (v;,x,M)v k (dx), 

J 771 

which we decompose in three parts: 

/•m+7 /> M — 7 

F a (u*',vo,M)= F H (v*,x,M)v (dx)+ F H (is*,x,M)v (dx) 



m+7 



+ F H {v;,x,M)v Q {dx). 

Direct calculation gives 

. (x - m) 2 (M - x)[x - to + (2g - 1){M - to)] 
Fh(v p ,x,M) = 

so that F H {v* p ,m,M) = F H (v*,M,M) = and F H (v*;v ,M) < F*v {{m + 
7,M-7]} with F* = maXp e i S);Ee [ mj M] Fh(v*,x, M) < oo. Also, d{v ,v*) < a 
implies i^o {(to + 7, M — 7]} = ^{(m + a, M— a]} < 2a, so that Fh{vZ) vq, M) < 
2aF*. Now, 



H( V ;,M) 



dA |a=m p(l — p)(M — to) 
which, together with (132p gives for 7 small enough 

H(vo, M - 7) < 1 + 2aF* - — X- 

p(l — pj(Az — m) 

and thus 

H ^ M -^ <1 - 2 P (l~ P )(M- m) 

for a < 7/[4F*p(l-p)(M - to)]. 

The situation is similar at to + 7. Together with (|31|) this implies for a small 
enough 

max H(vn,X) < Kn < 1 

XGSS A D[m+j,M-j] 

and therefore, 

^ 2 {(m + 7,M -7]} < 2K a (33) 

with < 1 not depending on a. 

Consider now the interval (M — 7, M\. We have 

v 2 {{M - 7, M]} = v 2 (M) = H{v Q ,M)v Q {{M - 7, M]} = H{v , M)v {M) , 
with d(yo, v*) < a implying vq(M) < 1 — p + a, and 

H(v ,M) = H(v*M) + F h {v* p ]Vq, M) + 0(a 2 ) < 1 + 2aF* + 0{a 2 ) . 
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This gives for a small enough 

u 2 {(M - 7, M}} <l-p + K x a 

for some K\ < oo. Similarly, v 2 {[m, m + 7]} < p + K\<x. 

Define po — p, p 2 — [v 2 {m) — v 2 (M) + l]/2, «o = We obtain 

-K a < v 2 {m) —p 2 < 0, -K a < v 2 {M) — (1 — p 2 ) < 

which together with implies 

d(v2,Vp) < a 2 = K a . 

Moreover, \p 2 - Po| < {Kq + Ki)a . 

For a small enough, p 2 € T s and we can then repeat the same arguments. 
This gives for any m 

d(v 2m ,Vp 2m ) < a 2m = K™a 

with 

^ 1 - If o 1-K 

and p2m G X S ) for ol small enough. For any p€l s and any e > 0, taking v$ such 
that d(vQ, v*) < a with a small enough thus implies d{v 2mi v*) < e for any m, 
and v* is thus stable. 

(b) Consider now the general situation. The proof follows the same lines 
as in case (a), but more technicalities are required since we need to consider 
measures of intervals of the form [m, m + 7] and (M — 7, M], with 7 decreasing 
in a suitable way as the number of iterations of the mapping T 2 increases. 

Assume that 

V2k{{m + 72/c, M - 72fc]} < 2a 2 fe , 
z/ 2fc {[m,TO + 72fc]} < p 2k + a 2k , 
V2k{{M - 72/t, M]} < 1 - p 2k + a 2 k . 

for some p 2k € X s and some a 2k , "f 2 k- Note that it implies d(v 2k , v p 2k ) < 72fc and 
that for k = 0, cto, 70 can be chosen arbitrarily small, with <i(i/o, f*) < ao for 
some pel s . 

Consider one application of the mapping T 2 at a generic iteration fc. We can 
write H(v 2k ,M) = #(^ 2fc , M) + ; v 2k , M) + 0( 7 2 fc ) with 

F H {vl 2k ;v2k,M) = I F H (vl 2k ,x,M)v 2 k{dx) 

J m 

M-72fc 

FH(Vp 2k ,x,M)v 2k (dx) 

TO+72A: 
M 

F H (v* p2h ,x,M)v 2k {dx). 
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The first integral term is of the order 0(j 2k ) (since F#(i/* 2fc , to, M) — and 
dF H (Vp 2k , z, M) /dz\ z =m — 0), the second is bounded by 2a 2 kF* + 0{y 2k ), as in 
case (a). For the third term, for which x is close to M, we can use the linear 
approximation 

F H (u; 2k , x ,M) = ( x -M) dFH{ ^f ,M \ +o(A) 

CLZ \z—M 



-2(a; -M) 2 
P2k(l~P2k) 2 (M-m) + [l2k> 



which gives 



where l2k(M) — JJ 2k zv' 2k (dz) with v' 2k the measure obtained after applying the 
transformation x i— > z = M — x. We have thus obtained 

H(u 2k , M) < 1 + 2a 2k F* + p2k{l 2I ^ M _ m) + <^ fc ) • (34) 

Consider now the behavior of I 2k (M) as k increases. We assume that v 2 k remains 
in some neighborhood V(p) of u*, which we shall be able to guarantee afterwards. 
Define A 2k (M) = I 2k {M)[Q 2k v' 2k {dz)]- 1 . It satisfies I 2k (M) < A 2k (M) < l2k . 
Also, 72(fe+i) < 72fc implies 

J^zH(v 2k ,M-zy 2k (dz) ^ k zH{ V2k ,M-z) V ' 2k {dz) 
2{k+1)[ '> J ^H(v 2k ,M-zy 2k (dz) < f^H(v 2k ,M-zy 2k (dz) ' 

and, since H(v 2k ,M — z) decreases for z close to zero, 



A 2( k+i){M) < rj — 



Io 2h Z H< H(V2^M) ) V 2k ( dz ) 



We can bound the speed of decrease of H(u 2 k,M — z): H{y, M — z)/H(u, M) < 
1 — az for some a > 0, any z in [0, 70] and any v € V(p). This gives 

g z(l - azy 2k (dz) 

A2(fe+1)(M)< JT^W ■ 

Repeating the same arguments we get for any n > 0, 

r™z{l-azYv> 2k {dz) 



A 2{k+n) (M) < A 2[k+n) {M) = 



with A 2 i k+n \ (M) decreasing with n. Direct calculation gives X)^Lo ^2(fc+n) (-^0 
1/a, and therefore I 2k {M) < A 2k (M) = o(l/k). 
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Similarly to case (a), we can write 

H(u 2k ,M - 72(fc+1) ) - H(u 2k ,M) 272( + 0( 72 2 fe ) , 

P2fe(l -p2k)\M - m) 

with H(v 2kl M) bounded by (|34|) . Assume that 72k is such that A2k(M) = 0(72/0) 
and a2k = o(72fe)- We obtain for p2k close enough to p 

H(v 2 k,M - 72(fc+1) ) < /3 2(fe+1) = 1 - p(1 J^ff_ m) • 05) 

We thus get the following bounds on the measure of subintervals of interest at 
the next iteration: 

« / 2(fc+i){(w + 7 2(i;+1 ),M- 7 2( fc+ i)]} < 2m&x{{3 2{k+1) ,K }a.2k (36) 

where K a = max„ 2fce y (p ) max Aess^n(Ai? fc .^f +1 ) H (" 2 k, A), and AT < 1 for p in X s 
and V(p) small enough, see part (a); 

V2(k+i){(M - l2{k+i), M \} < v 2 (k+i){{M - 72fe,M]} 
< H{ V 2k,M) V 2k{{M - j2k,M}} 
2A 2k (M) 
P2k(l-P2k) 2 {M -m) 

< V2k{{M - j2k, M}} + Ba 2k + CA 2k (M) + D^ k 



< 



l + 2a 2k t H ■—— r+C(7 2fey 



^2fe{(M- 72 fe,M]} 



for some B,C, D < 00. Similarly, we obtain 

V2(k+i){[m,m + 72( fc+ i)]} < v 2 k{[m,m + 7 2 fc]} + Ba 2 k + CA 2 k(m) + D^\ k 
where A 2k {m) is defined similarly to A 2 k{M). Define P2(k+i) as 

_ v 2(k+l){[™, m + 72(fe+l)]} - V2(k+1){(M - 72(fc+l), M]} + 1 
P2(fe+1) - ^ I 

it gives 

< P2(fc+i) - V2(k+i){[m,m + 7 2 (fe+i)]} < m&x{f3 2{k+1 ) , K }a 2k , 

< 1 -p 2 (k+i) - ^2(k+i){(M - j 2 (k+i), M}} < max{/3 2 ( /c+ i) ! ^o}a2fe ■ 

Together with (JSHJ) it implies d(v 2 (k+i) , K^h+d ) < 72(fc+i) < 72fc, with 

|P2(fe+i) ~P2k\ < A 2k = [B + 1 + max{/3 2(fc+1) , if }]a 2fc + C7L 2fe + L>7f fe , 

where A' 2k = max{A 2 fe(m), A 2k (M)} and J2k A' 2k < 00 ■ 

Define a 2 (k+i) = ntax{/3 2 (fc+i) , -f^o}a2fe and take 72/c = with g < 1, so 
that A' 2k — o{"f 2k ). From the definition of /3 2 (fc+i), see (|35|) . a 2 fe < 00 and 
«2fc = o(7 2 fe). Since J2 k A' 2k < 00, taking q > 1/2 in the definition ofj 2k ensures 
J2k^ 2k < 00 ■ We can repeat the same argument, and d{y 2 ^k+ n ) 1 v p 2{k+n) ) < 
72(fc+«) which tends to zero as n increases, with \p2(k+n) ~P2k\ remaining finite. 
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v 2{k+n) thus remains in some neighborhood V(p) of v* for any n, and V(p) can 
be made arbitrarily small by choosing ao and 70 small enough. ■ 



A5. Proof of Theorem^ Assume that Xo is such that for some k > 0, ||<7fc+i|| = 
with ||gfj|| > for all i < k (that is, Xk+i = x* and Xi ^ x* for i < k). This 
implies Rk(W) = for any W, and therefore R(W, x , x*) = R(x Q ,x*) = 0. 
Assume now that \\gk\\ > for all k. Consider 



V n = 

We have, 
and thus 



n r "(w) 



,fe=0 



l/n 



n 

,fc=0 



(Wg k+1 ,g k+1 ) 



(Wg k ,g k ) 



. (^50,50) 



l/n 



Vz e W, c||z|| 2 < (Wz,z) < C\\z\\' 



(c/C) 



l/n 



(9n,g n ) 



(go, go) 



l/r. 



<v n < (C/c) 1 /" 



{9n,9n) 

(go, go) 



l/n 



Since (c/C) 1 '" — > 1 and (C/c) 1 /" — > 1 as n — > 00, liminfn^oo V n and limsup,^^ 
V„ do not depend on W. Take W = P(A); it gives R k (W) = r k = 1 - 1/Lfc, see 
©, which is not decreasing, and thus lim n ^oo V n = 1 — 1/L for any T4^. ■ 
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